$$ \newcommand\de{\mathrel{\bullet\mkern-3mu{\rightarrow}}} \newcommand\ue{\mathrel{\bullet\mkern-2mu{-}\mkern-3mu\bullet}} \newcommand\ne{\mathrel{\;\mkern3mu}\;} $$

1 (Social) Network Data

Today we're going to demonstrate simple ways of loading network data, visualisation, analysing structure, and showing how this can help answer research questions.

So a recurring point I'll make today is that there are lots of systems that can be represented as a network. Perhaps you're interested in social networks like twitter (Murthy 2012), but there are other types of networks that could be of interest (Tero et al. 2010):

1.1 What is a Network?

A network is a way of representing how things are connected (or not). They can be social (who tweets to who), economic (which companies employ which people), engineering (which parts were used in each product) etc. The representing of these connetions is a network--also call a graph in mathematics, (confusing eh?)--and in the social sciences it's a tool, often applied to help quantiatively (and sometimes qualitatively) answer research questions (Heath, Fuller, and Johnston 2009).

1.2 Installing R and igraph

We're going to focus on the igraph R package today. There are many other options but igraph is a fairly comprehensive package for getting started. If you're looking to go beyond what we cover today, I recommend looking through the igraph documentation for your particular interests before trying the other packages, and feel free to email me griffith.rees@sheffield.ac.uk if you've got questions. Please ask detailed questions, demonstrating what you're trying to do and what you've tried so far so I can efficiently reply.

For those that have got igraph installed feel free to continue further down this handout to play around with the data.

Everyone else: please download RStudio from https://rstudio.com/products/rstudio/download/#download.

It should fit your operating system automatically (Windows, Linux or Mac). If you're not seeing an option (possibly old versions of Windows, macOS, or unusual Linux distributions) message me.1

Please fill in the survey when you're done or post messages if you're having issues.

Now let's install https://igraph.org/r/

install.packages("igraph")

and then load it in your R session.

library(igraph)

1.3 Test loading igraph data

We're going to jump into visualisation with igraph as a test of the install and a demonstration of visualisation options. Then we'll break it down into what's going on underneath but again, feel free to mess around with what's loaded as we continue on.

We begin by loading node and edge data from Dr Evelina Gabašová's excellent dataset on which characters appeared together in scenes of Star Wars films. Shamlessly borrowing from an NYU short course created by Dr Pablo Barberá's we focus on Episode IV - A New Hope.

You can have a look at my github repository for this course: https://github.com/griff-rees/network-analysis-course and download the repository. That includes the code for this handout and the data we're playing around with today.

Once you've downloaded and unzipped and have a look in the data folder to make sure there are 4 csv files, including star-wars-network-edges.csv and star-wars-network-nodes.csv The originals can be found at https://github.com/pablobarbera/data-science-workshop/tree/master/sna/data.

1.3.1 Load Node/Edge CSVs

First load the csv of nodes and have it print out the list of records.

nodes <- read.csv("data/star-wars-network-nodes.csv")
nodes
##           name id
## 1        R2-D2  0
## 2    CHEWBACCA  1
## 3        C-3PO  2
## 4         LUKE  3
## 5  DARTH VADER  4
## 6        CAMIE  5
## 7        BIGGS  6
## 8         LEIA  7
## 9         BERU  8
## 10        OWEN  9
## 11     OBI-WAN 10
## 12       MOTTI 11
## 13      TARKIN 12
## 14         HAN 13
## 15      GREEDO 14
## 16       JABBA 15
## 17     DODONNA 16
## 18 GOLD LEADER 17
## 19       WEDGE 18
## 20  RED LEADER 19
## 21     RED TEN 20
## 22   GOLD FIVE 21

You should see two columns: list of character names from the film alongside a list of id numbers. We'll look at this in detail later.

  • name (character names)
  • id (integer)

Next load a list of edges:

edges <- read.csv("data/star-wars-network-edges.csv")
edges
##         source      target weight
## 1        C-3PO       R2-D2     17
## 2         LUKE       R2-D2     13
## 3      OBI-WAN       R2-D2      6
## 4         LEIA       R2-D2      5
## 5          HAN       R2-D2      5
## 6    CHEWBACCA       R2-D2      3
## 7      DODONNA       R2-D2      1
## 8    CHEWBACCA     OBI-WAN      7
## 9        C-3PO   CHEWBACCA      5
## 10   CHEWBACCA        LUKE     16
## 11   CHEWBACCA         HAN     19
## 12   CHEWBACCA        LEIA     11
## 13   CHEWBACCA DARTH VADER      1
## 14   CHEWBACCA     DODONNA      1
## 15       CAMIE        LUKE      2
## 16       BIGGS       CAMIE      2
## 17       BIGGS        LUKE      4
## 18 DARTH VADER        LEIA      1
## 19        BERU        LUKE      3
## 20        BERU        OWEN      3
## 21        BERU       C-3PO      2
## 22        LUKE        OWEN      3
## 23       C-3PO        LUKE     18
## 24       C-3PO        OWEN      2
## 25       C-3PO        LEIA      6
## 26        LEIA        LUKE     17
## 27        BERU        LEIA      1
## 28        LUKE     OBI-WAN     19
## 29       C-3PO     OBI-WAN      6
## 30        LEIA     OBI-WAN      1
## 31       MOTTI      TARKIN      2
## 32 DARTH VADER       MOTTI      1
## 33 DARTH VADER      TARKIN      7
## 34         HAN     OBI-WAN      9
## 35         HAN        LUKE     26
## 36      GREEDO         HAN      1
## 37         HAN       JABBA      1
## 38       C-3PO         HAN      6
## 39        LEIA       MOTTI      1
## 40        LEIA      TARKIN      1
## 41         HAN        LEIA     13
## 42 DARTH VADER     OBI-WAN      1
## 43     DODONNA GOLD LEADER      1
## 44     DODONNA       WEDGE      1
## 45     DODONNA        LUKE      1
## 46 GOLD LEADER       WEDGE      1
## 47 GOLD LEADER        LUKE      1
## 48        LUKE       WEDGE      2
## 49       BIGGS        LEIA      1
## 50        LEIA  RED LEADER      1
## 51        LUKE  RED LEADER      3
## 52       BIGGS  RED LEADER      3
## 53       BIGGS       C-3PO      1
## 54       C-3PO  RED LEADER      1
## 55  RED LEADER       WEDGE      3
## 56 GOLD LEADER  RED LEADER      1
## 57       BIGGS       WEDGE      2
## 58  RED LEADER     RED TEN      1
## 59       BIGGS GOLD LEADER      1
## 60        LUKE     RED TEN      1

Again we'll go into this in further detail but for now it's just good to know everyone's got these up and working. You should see three columns:

  • source (a character's name)
  • target (another character's name)
  • weight (an integer for the nuber of scenes they share)

There are many ways of storing network information, and we'll look at other options later. For now: note the nodes are the names of characters and the edges are pairs of characters followed by the number of scenes they share in the film.

1.3.2 Creating a network

g <- graph_from_data_frame(d=edges, vertices=nodes, directed=FALSE)

We'll look at how this works in detail in the next section, but note here that we load the data.frame of edges into the d parameter and the data.frame of nodes (in this characters) as vertices (another name for nodes in network analysis). We're setting directed=FALSE for simplicity (more on this later).

1.3.3 Visualising a network

There are lots of ways to visualise networks. In part as a test of setup, we're jumping in to demonstrate how visualisation works, make sure everything's installed correctly and ensure the data is loading correctly as well.

plot(g)

Mostly illegible eh? Your plot will have a different layout but as long as it looks similar we should be ready to go.

Even with this rough layout we can, however, answer a research question:

Which character shares the least number of scenes with any other?

A: R2-D2 B: Gold Five C: Leia D: Jabba E: Greedo

Sneak preview of what's to come:

2 Networks (aka Graphs)

Now that we've got the basic package and a dataset up let's dive into the details of what's going on here. The goal is to give you enough of a foundation on network analysis to see how it can help answer research questions. It's also worth mentioned that in mathematics networks are often called graphs, and fit under the umbrella term graph theory hence the package name igraph. I'll use the terms graph and network interchangeably, while a visualisation of a network I'll try to call a plot (apologies in advance if I call a visualisation a graph!).

The basic components of networks aka graphs are nodes and edges.

2.1 Nodes

In the dataset we're using, Star Wars characters are nodes, also know as a vertices (or vertex if singular). This is analogous to records in classic datasets like people or firms, which can include characteristics such as age or income. Nodes can have different characteristics too--usually called node attributes--and these can help in analysing and understanding the network.

2.1.1 Node Attributes

Let's take a deeper look at the data we used to create the plot above:

nodes

As mentioned in a previous section there are two components of data for each of these nodes:

  • name (character names)
  • id (integer)

To see how that shows up when loaded into an igraph network use the V (vertices, synonymous with nodes) function:

V(g)
## + 22/22 vertices, named, from 9c6a611:
##  [1] R2-D2       CHEWBACCA   C-3PO       LUKE        DARTH VADER CAMIE      
##  [7] BIGGS       LEIA        BERU        OWEN        OBI-WAN     MOTTI      
## [13] TARKIN      HAN         GREEDO      JABBA       DODONNA     GOLD LEADER
## [19] WEDGE       RED LEADER  RED TEN     GOLD FIVE

This lists the vertices/nodes, the id it holds in memory and how many there are. The name attribute is particularly helpful in graph visualisation as it shows up automatically with igraph plots. If you've not used R before, here's a useful way to look at parts of data (usually columns in a table) using the $ symbol after the variable name, followed by the column name (like in the nodes variable) or the attribute name (like in the g network variable).

nodes$name
V(g)$name

We also get a list of all attributes with

vertex_attr(g)
## $name
##  [1] "R2-D2"       "CHEWBACCA"   "C-3PO"       "LUKE"        "DARTH VADER"
##  [6] "CAMIE"       "BIGGS"       "LEIA"        "BERU"        "OWEN"       
## [11] "OBI-WAN"     "MOTTI"       "TARKIN"      "HAN"         "GREEDO"     
## [16] "JABBA"       "DODONNA"     "GOLD LEADER" "WEDGE"       "RED LEADER" 
## [21] "RED TEN"     "GOLD FIVE"  
## 
## $id
##  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21

2.1.2 Adding Attributes

Let's add another attribute. Following the classic notion of the force in start wars we can add data indicating which side of the force characters are associated with (or othe if neighther). First we create vectors of names using the c (combine) function to get a vector of strings (names of nodes in this case).

# Create 
dark_side <- c("DARTH VADER", "MOTTI", "TARKIN")
light_side <- c("R2-D2", "CHEWBACCA", "C-3PO", "LUKE", "CAMIE", "BIGGS",
                "LEIA", "BERU", "OWEN", "OBI-WAN", "HAN", "DODONNA",
                "GOLD LEADER", "WEDGE", "RED LEADER", "RED TEN", "GOLD FIVE")
neutral <- c("GREEDO", "JABBA")

Now we can add colours to the nodes based on the vectors of names we just created. R has a set of colour names that can be used in plots, feel free to pick your own. By storing one of these colour names in the new color attribute, we can have that directly show up in the visualisation. Note the spelling of color: if you add to a colour coloumn the data is stored but may not automatically affect the plot.

To add the color attribute, first create the column with an NA (refers to 'not available') value

# Add the color attribute to the network nodes
V(g)$color <- NA  # Initialse the new 'color' attribute as NA for all nodes

then fill it up using the categories saved above. If you also run V(h)$color in between adding these you'll see the how the $color column gets populated

V(g)$color[V(g)$name %in% dark_side] <- "red" # set the dark side color name to red
V(g)$color
##  [1] NA    NA    NA    NA    "red" NA    NA    NA    NA    NA    NA    "red"
## [13] "red" NA    NA    NA    NA    NA    NA    NA    NA    NA

To break this down: the %in% operation tests if data in one column---in this case the $name variable which is to the left of %in%---matches with (is in) the options on the right of the %in%. Where it does match it returns TRUE, and elsewhere it returns FALSE.

V(g)$color[V(g)$name %in% light_side] <- "gold" # set the light side color name to gold
V(g)$color[V(g)$name %in% neutral] <- "green" # set the color of neutral characters to green
V(g)$color
##  [1] "gold"  "gold"  "gold"  "gold"  "red"   "gold"  "gold"  "gold"  "gold" 
## [10] "gold"  "gold"  "red"   "red"   "gold"  "green" "green" "gold"  "gold" 
## [19] "gold"  "gold"  "gold"  "gold"

In this case that means the colour value intended gets saved in places where they match the dark, light and neural sides. It's complicated to explain but extremely useful.

2.1.3 Subgraphs

These attributes can help us look at subgraphs: portions of the graph such as just the dark_side:

dark_side_graph <- induced_subgraph(g, dark_side) # Using the dark_side variable from above
V(dark_side_graph)
## + 3/3 vertices, named, from bef6366:
## [1] DARTH VADER MOTTI       TARKIN

This raises an important point: so many aspects of the world can be thought of as a network, and just as populations are sampled to make summary claims---such as income distribution or age---networks are often sampled for analysis. And just as we need to up front about sampling methods in many research contexts, we need to be aware that often there are portions of networks we cannot observe. Those sections may be very important, and in failing to observe them we can end up with very different structures and very different results.

In the classic sense of probability theory the law of large numbers suggests that often 1000 trials of an experiment, or random sampling from a population (often needing weighting to be accurate), can lead to representative results for the whole population. This can unfortunately be very difficult to manage in the case of networks (Browne 2005).

So: often quantitaive analysis of networks involves subgraphs, and it's worth being aware of that when analysing. Keep that in mind: we'll come back to this.

2.1.4 Visualising

It's worth acknowledging that we're focusing on specific attributes that help with visualisation. I got frustrated perparing some of these slides because I tried doing something like this but with variations in spelling...

# load the data to a new variable (f) in the same way we loaded for g prior to adding colour
f <- graph_from_data_frame(d=edges, vertices=nodes, directed=FALSE)

# Add a colour attribute which using the British rather than American spelling...
V(f)$colour <- NA  # Initialse the new 'colour' attribute as NA for all nodes
V(f)$colour[V(f)$name %in% dark_side] <- "red" # set the dark side colour name to red
V(f)$colour[V(f)$name %in% light_side] <- "gold" # set the light side colour name to gold
V(f)$colour[V(f)$name %in% neutral] <- "green" # set the colour of neutral characters to green
plot(f)

Notice no difference in colour. Now compare that with

plot(g)

This should illustrate two things:

  • You can add lots of attributes to nodes in a way similar to adding a column of data to a normal data.frame like the nodes variable
  • Some attributes can have a special role in particular network packages, such as color in igraph

Similarly: you can visualise specific sections of a graph, and that can be helpful in providing better detail/easier to read. Returning to the dark_side_graph subsample

plot(dark_side_graph)

This is a lot easier to read and can, as a way of zooming in, give us a clearer picture of some aspects of the structure. This becomes more crucial with much larger network datasets. It's already difficult to read the names of characters in these plots. Let's compare it with the other side of the force:

light_side_graph <- induced_subgraph(g, light_side) # Using the light_side variable
plot(light_side_graph)

Interesting and detailed by itself, but again worth being aware that it can make a significant difference comparing this with the rest of the network, and potentially misleading without acknowledging the network it's sampled from.

Like many other quantitative methodologies there are many other types of attributes that we can apply to nodes such as

  • Numerical (age, body mass index, income)
  • Boolean (yes/no, sex, citizenship , neutral)
  • Categorical (ethnicity, class, nationality, sector, color as combination of side of the force or neutral)
  • Spatial (geographic position, location in hospitals)

We'll return to the applicablity of these later but generally: most variables that can be used in classic statistical analysis can be applied in network analysis. It might be hard... but that data can generally be useful.

To close: be wary of sampling issues in network analysis!

2.2 Edges

Edges, also called links and ties, are the connections in networks/graphs. They can be friendship, kinship, contracts, following, liking, debt, etc. Our conversation right now is via a digital network, but if we were in a lab there would be conversations face to face, just with a lot more physical movement and scribbling on a white board.

2.2.1 Edge Data

To get started let's look at the second data file we loaded in the beginning

head(edges)
##      source target weight
## 1     C-3PO  R2-D2     17
## 2      LUKE  R2-D2     13
## 3   OBI-WAN  R2-D2      6
## 4      LEIA  R2-D2      5
## 5       HAN  R2-D2      5
## 6 CHEWBACCA  R2-D2      3

The head and tail functions are very handy ways to peak at datasets, especially very large ones. By default they return 6 records from the start or end of a data.frame respectively.

edges has three variables, the first two of which specify edges and the last is an attribute. We'll look at these in turn

2.2.1.1 Directed or Undirected

Edges can be categorised as directed or undericted. So far, we've been working with an undirected graph, and that's why we include the directed=FALSE parameter in creating g: g <- graph_from_data_frame(d=edges, vertices=nodes, directed=FALSE).

But it's no accident that the first two column names of the edges file are source and target. Different network packages follow different conventions (see the appendix for other packages) but directional information can be very important in networks. Someone can like someone else's tweet, just as one company can offer a contract to another or academic papers (hopefully) get cited by in other papers. It's important in these cases to specify where the direction of the connection comes from, and some of the first social network analysis research (then called sociograms (Festinger 1954)) use directional infromation:

Sociogram

Sociogram

In this study, people were individually asked to name friends at school. This diagram of friendship in a 4th grade class (US term for UK year 5) in the 1930s is a famous demonstration of a case where friendship is highly correlated with gender. If you look closesly (try ctrl/cmd + to zoom in) you'll see arrows pointing at circles such as from EL to SH while in a few other cases like the connection between BR and MC there are no arrows.

This means that EL named SH as a friend but SH didn't reciprocate (also name EL as a friend), while BR and MC both named each other as friends.

Here's a newer diagram of the same network which is a bit easier to read from http://www.martingrandjean.ch/social-network-analysis-visualization-morenos-sociograms-revisited/:

Directed Network

Directed Network

With the additional information of the shapes in the first digram, which map to colours in the second, this is a way of demonstrating the separation of social groups by gender and also how those groups are very weakly tied together. Weakness in as much as only one connection is named across gender and it is not reciprocated. There are ways of quantifying how different groups are connected in networks which is outside the scope of the session today (though there's a bit later...), but there is lots of research on this in many other types network analysis including social (Leskovec, Lang, and Mahoney 2010).

To summarise: undirected networks have two basic states between nodes

Undirected Networks

  • Connected (a is linked to b) formalised as \[a \ue b\]
  • Unconnected (a is not linked to b) formalised as \[a \ne b\]

while directed networks have 3 basic states between nodes

Directed Networks

  • Unidirectional (connection from a to b) formalised as \[a \de b\]
  • Bidirectional (connections from a to b and b to a) formalised as \[ a \leftrightarrow b \]
  • Unconnected (a is not linked to b) formalised as above \[a \ne b\]

And returning to the way we constructed g originally:

g <- graph_from_data_frame(d=edges, vertices=nodes, directed=FALSE)

we can also construct that network as directed by leaving out directed=FALSE because the default value is directed=TRUE

d <- graph_from_data_frame(d=edges, vertices=nodes, directed=TRUE)
plot(d)

and just to help you remember, this is equivalent to the default state without including the directed parameter (I've forgotten this myself many times)

d <- graph_from_data_frame(d=edges, vertices=nodes)
plot(d)

2.2.1.2 Edge Attributes

Just like nodes, edges can have attributes. The directionality in the previous section is an example of information associated with a tie, and the presence or absence of one. One of the most common examples of tie attributes is included in the Star Wars dataset, and is usually described as a weighted tie. This is the third column in edges labeled weight

head(edges)
##      source target weight
## 1     C-3PO  R2-D2     17
## 2      LUKE  R2-D2     13
## 3   OBI-WAN  R2-D2      6
## 4      LEIA  R2-D2      5
## 5       HAN  R2-D2      5
## 6 CHEWBACCA  R2-D2      3

This is the number of scenes that both characters share in the film. If combined with the source and target information then it's a means of weighting directed ties/edges.

With all this in mind, we can look at the original Star Wars network for a glimpse of what's included in the whole structure:

g
## IGRAPH 6a96d81 UNW- 22 60 -- 
## + attr: name (v/c), id (v/n), weight (e/n)
## + edges from 6a96d81 (vertex names):
##  [1] R2-D2      --C-3PO       R2-D2      --LUKE        R2-D2      --OBI-WAN    
##  [4] R2-D2      --LEIA        R2-D2      --HAN         R2-D2      --CHEWBACCA  
##  [7] R2-D2      --DODONNA     CHEWBACCA  --OBI-WAN     CHEWBACCA  --C-3PO      
## [10] CHEWBACCA  --LUKE        CHEWBACCA  --HAN         CHEWBACCA  --LEIA       
## [13] CHEWBACCA  --DARTH VADER CHEWBACCA  --DODONNA     LUKE       --CAMIE      
## [16] CAMIE      --BIGGS       LUKE       --BIGGS       DARTH VADER--LEIA       
## [19] LUKE       --BERU        BERU       --OWEN        C-3PO      --BERU       
## [22] LUKE       --OWEN        C-3PO      --LUKE        C-3PO      --OWEN       
## + ... omitted several edges

It's a bit technical but what shown is a summary of the network object:

  • U means undirected
  • N means named graph (hence the names attribute)
  • W means weighted graph (hence the weight attribute)
  • 22 is the number of nodes
  • 60 is the number of edges
  • name (v/c) means name is a node attribute and it's a character (aka a string, or list of characters)
  • weight (e/n) means weight is an edge attribute and it's numeric

The rows are indicating connections between nodes so record [1] is between R2-D2 and C-3PO. Also note for those of you used to python: R is 1 indexed rather than 0 indexed...

Similar to the vertices V() function there is an edge function E() which prints the connections section of the summary of g just described.

E(g)
## + 60/60 edges from 6a96d81 (vertex names):
##  [1] R2-D2      --C-3PO       R2-D2      --LUKE        R2-D2      --OBI-WAN    
##  [4] R2-D2      --LEIA        R2-D2      --HAN         R2-D2      --CHEWBACCA  
##  [7] R2-D2      --DODONNA     CHEWBACCA  --OBI-WAN     CHEWBACCA  --C-3PO      
## [10] CHEWBACCA  --LUKE        CHEWBACCA  --HAN         CHEWBACCA  --LEIA       
## [13] CHEWBACCA  --DARTH VADER CHEWBACCA  --DODONNA     LUKE       --CAMIE      
## [16] CAMIE      --BIGGS       LUKE       --BIGGS       DARTH VADER--LEIA       
## [19] LUKE       --BERU        BERU       --OWEN        C-3PO      --BERU       
## [22] LUKE       --OWEN        C-3PO      --LUKE        C-3PO      --OWEN       
## [25] C-3PO      --LEIA        LUKE       --LEIA        LEIA       --BERU       
## [28] LUKE       --OBI-WAN     C-3PO      --OBI-WAN     LEIA       --OBI-WAN    
## + ... omitted several edges

Also in similarity to nodes, edge attributes are accessed by $:

E(g)$weight
##  [1] 17 13  6  5  5  3  1  7  5 16 19 11  1  1  2  2  4  1  3  3  2  3 18  2  6
## [26] 17  1 19  6  1  2  1  7  9 26  1  1  6  1  1 13  1  1  1  1  1  1  2  1  1
## [51]  3  3  1  1  3  1  2  1  1  1

and the list of attributes is printed by the edge_attr function

edge_attr(g)
## $weight
##  [1] 17 13  6  5  5  3  1  7  5 16 19 11  1  1  2  2  4  1  3  3  2  3 18  2  6
## [26] 17  1 19  6  1  2  1  7  9 26  1  1  6  1  1 13  1  1  1  1  1  1  2  1  1
## [51]  3  3  1  1  3  1  2  1  1  1

I realise this is going very quickly but the point is to demonstrate how similar accessing edge attribute information is to accessing node attributes.

With this demonstrated, we can now try adding another edge attribute. This is trickier than adding node attributes because there are so many edges (and this gets more complicated depending on whether directed or undirected) but similar to the method we used before:

E(g)$color <- "blue"
E(g)$color[E(g)$weight >= 5] <- "red"

This is a very simple case where we're colouring edges based on weight, once again using the standard colour names that come with R. We've set the color criteria to be for edges of greater weight than five (of which there are very few).

One again using the edge_attr function we can see what's been added:

edge_attr(g)
## $weight
##  [1] 17 13  6  5  5  3  1  7  5 16 19 11  1  1  2  2  4  1  3  3  2  3 18  2  6
## [26] 17  1 19  6  1  2  1  7  9 26  1  1  6  1  1 13  1  1  1  1  1  1  2  1  1
## [51]  3  3  1  1  3  1  2  1  1  1
## 
## $color
##  [1] "red"  "red"  "red"  "red"  "red"  "blue" "blue" "red"  "red"  "red" 
## [11] "red"  "red"  "blue" "blue" "blue" "blue" "blue" "blue" "blue" "blue"
## [21] "blue" "blue" "red"  "blue" "red"  "red"  "blue" "red"  "red"  "blue"
## [31] "blue" "blue" "red"  "red"  "red"  "blue" "blue" "red"  "blue" "blue"
## [41] "red"  "blue" "blue" "blue" "blue" "blue" "blue" "blue" "blue" "blue"
## [51] "blue" "blue" "blue" "blue" "blue" "blue" "blue" "blue" "blue" "blue"

And similar to node attributes, color has the extra significance of automatically being added to plots (though as this is few edges it might not show up obviously on your screen).

plot(g)

Hope it's clear on some screens. We've obviously got issues with positioning which is to come. We're very close to finishing the edge section. We've got one more part to cover then on to the details of visualisation.

But before we do, once again we can use this visualisation (and the weight attribute to answer a question): which pair of characters share the most scenes in Star Wars IV?

2.2.1.3 Adjacency Matrix

So far the edges have all been via lists of pairs of node names, but other ways of representing edges is via a matrix, and that features a lot in other network packages. I'm just going to briefly show you that now

g[]
## 22 x 22 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 22 column names 'R2-D2', 'CHEWBACCA', 'C-3PO' ... ]]
##                                                               
## R2-D2        .  3 17 13 . . .  5 . .  6 . .  5 . . 1 . . . . .
## CHEWBACCA    3  .  5 16 1 . . 11 . .  7 . . 19 . . 1 . . . . .
## C-3PO       17  5  . 18 . . 1  6 2 2  6 . .  6 . . . . . 1 . .
## LUKE        13 16 18  . . 2 4 17 3 3 19 . . 26 . . 1 1 2 3 1 .
## DARTH VADER  .  1  .  . . . .  1 . .  1 1 7  . . . . . . . . .
## CAMIE        .  .  .  2 . . 2  . . .  . . .  . . . . . . . . .
## BIGGS        .  .  1  4 . 2 .  1 . .  . . .  . . . . 1 2 3 . .
## LEIA         5 11  6 17 1 . 1  . 1 .  1 1 1 13 . . . . . 1 . .
## BERU         .  .  2  3 . . .  1 . 3  . . .  . . . . . . . . .
## OWEN         .  .  2  3 . . .  . 3 .  . . .  . . . . . . . . .
## OBI-WAN      6  7  6 19 1 . .  1 . .  . . .  9 . . . . . . . .
## MOTTI        .  .  .  . 1 . .  1 . .  . . 2  . . . . . . . . .
## TARKIN       .  .  .  . 7 . .  1 . .  . 2 .  . . . . . . . . .
## HAN          5 19  6 26 . . . 13 . .  9 . .  . 1 1 . . . . . .
## GREEDO       .  .  .  . . . .  . . .  . . .  1 . . . . . . . .
## JABBA        .  .  .  . . . .  . . .  . . .  1 . . . . . . . .
## DODONNA      1  1  .  1 . . .  . . .  . . .  . . . . 1 1 . . .
## GOLD LEADER  .  .  .  1 . . 1  . . .  . . .  . . . 1 . 1 1 . .
## WEDGE        .  .  .  2 . . 2  . . .  . . .  . . . 1 1 . 3 . .
## RED LEADER   .  .  1  3 . . 3  1 . .  . . .  . . . . 1 3 . 1 .
## RED TEN      .  .  .  1 . . .  . . .  . . .  . . . . . . 1 . .
## GOLD FIVE    .  .  .  . . . .  . . .  . . .  . . . . . . . . .

For those of you interested in the details, the matrix is symmetrical if undirected and potentially asymetrical if directed. Each row and column are lists of connections between one node and all the others (including itself, which is the diagonal down the middle). I'm going to leave that there for today but just to give you some idea of where that comes from and what that means if you see a package asking for an ajacency matrix.

As an exercise: spot the differences to the directed ajacency matrix:

d[]
## 22 x 22 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 22 column names 'R2-D2', 'CHEWBACCA', 'C-3PO' ... ]]
##                                                             
## R2-D2        . . .  . . . .  . . .  . . .  . . . . . . . . .
## CHEWBACCA    3 . . 16 1 . . 11 . .  7 . . 19 . . 1 . . . . .
## C-3PO       17 5 . 18 . . .  6 . 2  6 . .  6 . . . . . 1 . .
## LUKE        13 . .  . . . .  . . 3 19 . .  . . . . . 2 3 1 .
## DARTH VADER  . . .  . . . .  1 . .  1 1 7  . . . . . . . . .
## CAMIE        . . .  2 . . .  . . .  . . .  . . . . . . . . .
## BIGGS        . . 1  4 . 2 .  1 . .  . . .  . . . . 1 2 3 . .
## LEIA         5 . . 17 . . .  . . .  1 1 1  . . . . . . 1 . .
## BERU         . . 2  3 . . .  1 . 3  . . .  . . . . . . . . .
## OWEN         . . .  . . . .  . . .  . . .  . . . . . . . . .
## OBI-WAN      6 . .  . . . .  . . .  . . .  . . . . . . . . .
## MOTTI        . . .  . . . .  . . .  . . 2  . . . . . . . . .
## TARKIN       . . .  . . . .  . . .  . . .  . . . . . . . . .
## HAN          5 . . 26 . . . 13 . .  9 . .  . . 1 . . . . . .
## GREEDO       . . .  . . . .  . . .  . . .  1 . . . . . . . .
## JABBA        . . .  . . . .  . . .  . . .  . . . . . . . . .
## DODONNA      1 . .  1 . . .  . . .  . . .  . . . . 1 1 . . .
## GOLD LEADER  . . .  1 . . .  . . .  . . .  . . . . . 1 1 . .
## WEDGE        . . .  . . . .  . . .  . . .  . . . . . . . . .
## RED LEADER   . . .  . . . .  . . .  . . .  . . . . . 3 . 1 .
## RED TEN      . . .  . . . .  . . .  . . .  . . . . . . . . .
## GOLD FIVE    . . .  . . . .  . . .  . . .  . . . . . . . . .

Note: there aren't any cases of links to oneself in the Star Wars example (which wouldn't make sense in a film... except for maybe a sci-fi time travel one) but in other cases that can be applicable, such as emailing oneself (something I do all the time as reminders).

2.3 Visualisation

Finally down to visualisation! We've obviously spent almost all this time on constructing the graph and understanding the data that composes it. But what's really annoying throughout (and generally a very hard problem, for which there are many packages) is a good visual presentation of a network. There's lots of work on this but to just give you a taste we're going to dive into the plot function and the layout parameter, keeping the endge highlighting we've been working on

2.3.1 Basic Layout options

par(mfrow=c(2, 3), mar=c(0,0,1,0))
plot(g, layout=layout_randomly, main="Random")
plot(g, layout=layout_in_circle, main="Circle")
plot(g, layout=layout_as_star, main="Star")
plot(g, layout=layout_as_tree, main="Tree")
plot(g, layout=layout_on_grid, main="Grid")
plot(g, layout=layout_with_fr, main="Force-directed")

These are just some of the layout options to choose from. The layout_with_fr is a favorite of many, so that's what I used at the start. So to get closer to what I showed at the statrt:

plot(g, layout=layout_with_fr, main="Force-directed")

Note that the size of the plot for the grid was specified, but for the inidividual example I've left it blank (default). The plot is entirely based on a random number generator, so everyone's will look a bit different. We can reproduce the same arrangement by specifying a seed (coming up below).

2.3.2 Layout with node colouring and specific seed

Bringing the node colouring back in, we can make the most of node and edge attributes.

V(g)$color <- NA
V(g)$color[V(h)$name %in% dark_side] <- "red"
V(g)$color[V(h)$name %in% light_side] <- "gold"
V(g)$color[V(h)$name %in% neutral] <- "green"
par(mfrow=c(2, 3), mar=c(0,0,1,0))
plot(g, layout=layout_randomly, main="Random")
plot(g, layout=layout_in_circle, main="Circle")
plot(g, layout=layout_as_star, main="Star")
plot(g, layout=layout_as_tree, main="Tree")
plot(g, layout=layout_on_grid, main="Grid")
plot(g, layout=layout_with_fr, main="Force-directed")

set.seed(3339)  # Speficy a seed for reproducing network layout

plot(g, layout=layout_with_fr, main="Force-directed")
legend(x=.75, y=.75, legend=c("Dark side", "Light side", "Neutral"), 
       pch=21, pt.bg=c("red", "gold", "green"), pt.cex=2, bty="n")

Here we've now specified the set.seed(3339) command, which means that the random number generator should produce the same results on your screen.

2.3.3 Interactive layout

But: should we just keep trying different numbers until it looks good? That's a really annoying problem, and a reason why there are many options for visualising networks. igraph has a basic option to help with this but it can be difficult to set up and at the moment it's more likely to work on windows and linux than macOS. There is a workaround for macOS: https://github.com/sethrfore/homebrew-r-srf but now's not the time to try.

set.seed(3339)  # Speficy a seed for reproducing network layout

tkplot(g, )
legend(x=.75, y=.75, legend=c("Dark side", "Light side", "Neutral"), 
       pch=21, pt.bg=c("red", "gold", "green"), pt.cex=2, bty="n")

snahelper (included in this above) is another very new option (apologies for the lack of warning). It's a collection of tools which seem to work accross platforms and provide sophisticated options for altering graphs interactively.

library(snahelper)
SNAhelperGadget(g)

Other other options include

  • ggnetwork: R package for use with ggplot, related to snahelper
  • gephi: a standalone package for interactive arrangements https://gephi.org/

3 Lab 1

We've covered loading and basic visualisation of network data. To see how it works on data that might be closer to research you're interested in we're going to run a workshop with another, more real world dataset from twitter.

In the folder with the Star Wars data you should see two data files called congress-twitter-network-edges.csv and congress-twitter-network-nodes.csv.

Your task, should you choose to accept it, is to load these datafiles, visualise them, and have a go at answering a research question. Hint: may the force plot be with you.

4 Basic Network Analysis

So far, all we've done is look at visualising networks, which is a form of analysis hence the research questions we considered, but it has many ideosyncracies and can easily be customised in ways that may be very helpful to one interpretation of results, but that can also be very misleading. Further, network analysis as a whole has often led to a lot of arguments in more rigorous analysis and interpretation.

4.1 Example of debates on results

If any of you saw my taster session, this might be familiar. For the rest: this is a video of a presentation around a very highly cited network analysis study assessing the social contagiousness of behaviour that leads to obesity. Note this involves another dimension we haven't discussed which is unfortunately quite complex in network analysis: change over time.

4.1.1 Too impressive to be correct?

4.1.2 There's been a lot of criticism2...

  • Homophily and Contagion Are Generically Confounded in Observational Social Network Studies (Shalizi and Thomas 2011)
  • The “unfriending” problem: The consequences of homophily in friendship retention for causal estimates of social influence (Noel and Nyhan 2011)
  • Is obesity contagious? Social networks vs. environmental factors in the obesity epidemic (Cohen-Cole and Fletcher 2008)
  • The spread of obesity in a large social network over 32 years (Mercken et al. 2010)3
  • ... and a reply: Christakis & Fowler (2013)

So it's worth once more bearing in mind that network analysis is hard, and there are issues with methods, application and interpretability. But: keeping that in mind, we shall endeavor.

4.2 Summary (Marco-ish)

We'll start with analyses that summarise the network structure. It's helpful to get an overview before drilling down to specific sections of a network. The differences between undirected and directed networks feature heavily and I'll try to cover that throughout. This is very quick and meant as hints of how this can be useful.

4.2.1 Density

One of the first and most straight forward components of quantitative assessment is density. Density is a measure of how many connections there are between nodes divided to how many possible connections. It's often just called network density, but the igraph method is edge_density to highlight that it's specific to edges.

edge_density(g)
## [1] 0.2597403

note that the combinatorics of directed networks are (normally) twice that of undirected networks

edge_density(d)
## [1] 0.1298701

4.2.2 Number of components

A component is a portion of a network, often referred to as a subgraph but sometimes is a short hand for the largest connected component. These are ways of looking at and assessing separate (because of no ties) parts of a network.

4.2.2.1 Connected Components

The first thing to consider in these cases is whether the network is connected, meaning that every node has at least one directed edge to another.

is_connected(g)
## [1] FALSE

This is applicable irrespective of whether the network is directed. In this case we have one node with no edges so we have two components.

V(g)
## + 22/22 vertices, named, from 6a96d81:
##  [1] R2-D2       CHEWBACCA   C-3PO       LUKE        DARTH VADER CAMIE      
##  [7] BIGGS       LEIA        BERU        OWEN        OBI-WAN     MOTTI      
## [13] TARKIN      HAN         GREEDO      JABBA       DODONNA     GOLD LEADER
## [19] WEDGE       RED LEADER  RED TEN     GOLD FIVE
components(g)
## $membership
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##           1           1           1           1           1           1 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##           1           1           1           1           1           1 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##           1           1           1           1           1           1 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##           1           1           1           2 
## 
## $csize
## [1] 21  1
## 
## $no
## [1] 2

$no indicates the number of components, $csize is the size of these components and $membership lists which components each node is in. Note: this isn't a count of components, nodes can only be in 1. GOLD FIVE is in component 2, not 2 components.

These results are fine for undericted networks but for directed networks things are (as ever) more complicated:

4.2.2.2 Weak connectivity in directed networks

Weak connect meanse that there is at least one path from any node in a component to any other in that component, even if it doesn't follow the same direction. So if we have \(a \de b; c \de b\) but not \(c \de a\) then there is 1 connected.

connectivity_example <- graph(c("a", "b", "c", "b"))
components(connectivity_example)
## $membership
## a b c 
## 1 1 1 
## 
## $csize
## [1] 3
## 
## $no
## [1] 1

This is the same as the undirected example

V(d)
## + 22/22 vertices, named, from 8f6deba:
##  [1] R2-D2       CHEWBACCA   C-3PO       LUKE        DARTH VADER CAMIE      
##  [7] BIGGS       LEIA        BERU        OWEN        OBI-WAN     MOTTI      
## [13] TARKIN      HAN         GREEDO      JABBA       DODONNA     GOLD LEADER
## [19] WEDGE       RED LEADER  RED TEN     GOLD FIVE
components(d)
## $membership
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##           1           1           1           1           1           1 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##           1           1           1           1           1           1 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##           1           1           1           1           1           1 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##           1           1           1           2 
## 
## $csize
## [1] 21  1
## 
## $no
## [1] 2

4.2.2.3 Strong connectivity in directed networks

Strong connectivity includes the directionality of connections, so if a directed path doesn't exist between a pair of nodes then they aren't considered connected. Returning to the example above:

components(connectivity_example, mode = "strong")
## $membership
## a b c 
## 2 3 1 
## 
## $csize
## [1] 1 1 1
## 
## $no
## [1] 3

but if we make one of the connections bidirectional (\(b \leftrightarrow a\)), then we have one fewer component

connectivity_example <- add_edges(connectivity_example, c("b", "a"))
components(connectivity_example, mode = "strong")
## $membership
## a b c 
## 2 2 1 
## 
## $csize
## [1] 1 2
## 
## $no
## [1] 2

The directed Star Wars network:

components(d, mode = "strong")
## $membership
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##          22           7           6          16          12           5 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##           4          13           3          21          20          14 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##          15          10           2          11           8           9 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##          19          17          18           1 
## 
## $csize
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## $no
## [1] 22

4.2.2.4 Extract components

Just as a basic example (feel free to extrapolate many options) you can also extract components as separate graph objects. This was demonstrated above with the induced_subgraph command, but for that we had to specify sections of the network that would be less convenient than the decompose function

decompose(d)
## [[1]]
## IGRAPH d2b55b6 DNW- 21 60 -- 
## + attr: name (v/c), id (v/n), weight (e/n)
## + edges from d2b55b6 (vertex names):
##  [1] C-3PO      ->R2-D2       LUKE       ->R2-D2       OBI-WAN    ->R2-D2      
##  [4] LEIA       ->R2-D2       HAN        ->R2-D2       CHEWBACCA  ->R2-D2      
##  [7] DODONNA    ->R2-D2       CHEWBACCA  ->OBI-WAN     C-3PO      ->CHEWBACCA  
## [10] CHEWBACCA  ->LUKE        CHEWBACCA  ->HAN         CHEWBACCA  ->LEIA       
## [13] CHEWBACCA  ->DARTH VADER CHEWBACCA  ->DODONNA     CAMIE      ->LUKE       
## [16] BIGGS      ->CAMIE       BIGGS      ->LUKE        DARTH VADER->LEIA       
## [19] BERU       ->LUKE        BERU       ->OWEN        BERU       ->C-3PO      
## [22] LUKE       ->OWEN        C-3PO      ->LUKE        C-3PO      ->OWEN       
## + ... omitted several edges
## 
## [[2]]
## IGRAPH 129c3c9 DNW- 1 0 -- 
## + attr: name (v/c), id (v/n), weight (e/n)
## + edges from 129c3c9 (vertex names):

We'll come back to this later.

4.2.3 Diameter

Diameter is the longest shortest path (confusing eh?). To unpack: find the shortest paths between every pair of nodes within the same component and then find the maximum of those shortest paths. Once again undirected vs directed makes a difference.

4.2.3.1 Unweighted

For ease of demonstration we will avoid the default which includes weights, first undirected:

diameter(g, weights=NA)  # Weights are included by default, more later
## [1] 3
get_diameter(g, weights=NA)
## + 4/22 vertices, named, from 6a96d81:
## [1] DARTH VADER CHEWBACCA   C-3PO       OWEN
farthest_vertices(g, weights=NA)
## $vertices
## + 2/22 vertices, named, from 6a96d81:
## [1] DARTH VADER OWEN       
## 
## $distance
## [1] 3

and then directed:

diameter(d, weights=NA)
## [1] 4
get_diameter(d, weights=NA)
## + 5/22 vertices, named, from 8f6deba:
## [1] BERU        C-3PO       CHEWBACCA   DODONNA     GOLD LEADER
farthest_vertices(d, weights=NA)
## $vertices
## + 2/22 vertices, named, from 8f6deba:
## [1] BERU        GOLD LEADER
## 
## $distance
## [1] 4

4.2.3.2 Weighted

By default, if edges are weighted then that is included in distance measures and can lead to different results, once again varying between undirected

diameter(g)  # Weights are included by default
## [1] 10
get_diameter(g)
## + 5/22 vertices, named, from 6a96d81:
## [1] CAMIE BIGGS C-3PO HAN   JABBA
farthest_vertices(g)
## $vertices
## + 2/22 vertices, named, from 6a96d81:
## [1] CAMIE JABBA
## 
## $distance
## [1] 10

and directed

diameter(d)
## [1] 30
get_diameter(d)
## + 4/22 vertices, named, from 8f6deba:
## [1] GREEDO HAN    LUKE   OWEN
farthest_vertices(d)
## $vertices
## + 2/22 vertices, named, from 8f6deba:
## [1] GREEDO OWEN  
## 
## $distance
## [1] 30

4.3 Node Properties (Micro-ish)

Again, we're racing through a lot of things so there's much more to ideally cover, but we'll finish with a set of analysis options at node rather than graph level. Many of these will also assume a connected component to make useful relative comparisons.

4.3.1 Degree

Degree in a network context is how many edges a node has. Once again the undirected vs directed variation is important. Given this is an individual level calculation we have to choose a node or a set of nodes. We can begin with a degree distribution to get a flavour of what's worth focusing on

4.3.1.1 Undirected and default directed

The degree command for the undirected network will return the number of edges each node has by default. You can add a normalized parameter to indicate it as a proportion of the total degrees.

degree(g)
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##           7           8          10          15           5           2 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##           7          12           4           3           7           3 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##           3           8           1           1           5           5 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##           5           7           2           0
degree(g, normalized = TRUE)
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##  0.33333333  0.38095238  0.47619048  0.71428571  0.23809524  0.09523810 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##  0.33333333  0.57142857  0.19047619  0.14285714  0.33333333  0.14285714 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##  0.14285714  0.38095238  0.04761905  0.04761905  0.23809524  0.23809524 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##  0.23809524  0.33333333  0.09523810  0.00000000

You can also add this to a size attribute on each node to incorporate that into the plot

set.seed(3339)

V(g)$size <- degree(g)

plot(g, layout=layout_with_fr, main="Force-directed")
legend(x=.75, y=.75, legend=c("Dark side", "Light side", "Neutral"), 
       pch=21, pt.bg=c("red", "gold", "green"), pt.cex=2, bty="n")

Directed graphs use the mode = 'all' by default so size of the nodes ends up being the same as the undirected example.

degree(d, mode = "all") # All is the default, but for clairty
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##           7           8          10          15           5           2 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##           7          12           4           3           7           3 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##           3           8           1           1           5           5 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##           5           7           2           0
set.seed(3339)
V(d)$size <- degree(g, mode = "all")

plot(d, layout=layout_with_fr, main="Force-directed")
legend(x=.75, y=.75, legend=c("Dark side", "Light side", "Neutral"), 
       pch=21, pt.bg=c("red", "gold", "green"), pt.cex=2, bty="n")

But there are variations...

4.3.1.2 Directed In Degree

The in degree is the number of cases where a node is recieving a connection

degree(d, mode = "in")
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##           7           1           2           9           1           1 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##           0           6           0           3           6           2 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##           3           3           0           1           1           2 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##           5           5           2           0

so R2-D2 has pretty much all his edges recieved while CHEWBACCA has arguably more extrovert.

4.3.1.3 Directed Out Degree

The out degree is the number of cases where a node is source of a connection

degree(d, mode = "out")
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##           0           7           8           6           4           1 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##           7           6           4           0           1           1 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##           0           5           1           0           4           3 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##           0           2           0           0

Hence R2-D2 on 0 (seems to get all the attention, but not reciprocate) while CHEWBACCA's focus is on others... awww.

4.3.2 Centrality

Last individual level thing we can cover today is centrality, of which, like everything else I've shown there are many variations. I won't go into this too much due to time but it's a very active area of research, especially in the hard sciences. Degree and it's variations is actually an example of this.

4.3.2.1 Closeness

This is the inverse of the shortest path between nodes and incorporates weights by default. The default focus is on received connections (in in the previous examples). Note the warnings, it's worth separating into connected components to avoid the shift from GOLD FIVE

closeness(d, mode = "out")
## Warning in closeness(d, mode = "out"): At centrality.c:2617 :closeness
## centrality is not well-defined for disconnected graphs
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
## 0.002164502 0.004926108 0.005076142 0.002695418 0.003300330 0.002754821 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
## 0.007633588 0.003164557 0.005952381 0.002164502 0.002242152 0.002262443 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
## 0.002164502 0.002652520 0.002724796 0.002164502 0.003144654 0.002849003 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
## 0.002164502 0.002369668 0.002164502 0.002164502
closeness(d, mode="in")
## Warning in closeness(d, mode = "in"): At centrality.c:2617 :closeness centrality
## is not well-defined for disconnected graphs
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
## 0.003703704 0.002415459 0.002375297 0.003067485 0.002525253 0.002262443 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
## 0.002164502 0.002890173 0.002164502 0.003058104 0.003215434 0.003030303 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
## 0.003205128 0.002544529 0.002164502 0.002652520 0.002525253 0.002688172 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
## 0.004098361 0.003984064 0.004273504 0.002164502
closeness(d, mode="all")
## Warning in closeness(d, mode = "all"): At centrality.c:2617 :closeness
## centrality is not well-defined for disconnected graphs
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
## 0.009708738 0.010526316 0.011494253 0.011363636 0.010309278 0.008849558 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
## 0.012345679 0.010869565 0.010416667 0.008474576 0.010204082 0.010204082 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
## 0.009009009 0.006211180 0.005555556 0.005555556 0.012048193 0.012048193 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
## 0.010416667 0.011764706 0.010526316 0.002164502

4.3.2.2 Betweeness

This is a way of looking at connections with a focus on individuals roles in the structure. Think back to the sociogram example: that one weak tie connected across genders, and while it's not strong like closeness, it's really important to the overall structure.

betweenness(g)
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##   22.750000   15.916667   32.783333   18.333333   15.583333    0.000000 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##   31.916667   59.950000    1.666667    0.000000    0.000000    0.000000 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##    0.000000   37.000000    0.000000    0.000000   47.533333   23.800000 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##    0.000000   31.416667    2.200000    0.000000
betweenness(d)
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##    0.000000   10.500000   12.583333   19.250000    5.000000    0.000000 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##    0.000000   31.083333    0.000000    0.000000    0.000000    0.000000 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##    0.000000   15.000000    0.000000    0.000000   10.500000    3.833333 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##    0.000000   12.750000    0.000000    0.000000

4.3.3 PageRank

Last flavour: this was the core example underlying Google's initial algorithm of structure in links between websites.

page_rank(g)
## $vector
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
## 0.068538690 0.086390090 0.088708430 0.185268949 0.034576040 0.013792262 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
## 0.035070288 0.086027500 0.020368818 0.018881975 0.067378471 0.016813964 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
## 0.034180007 0.114631333 0.008310156 0.008310156 0.016185680 0.017945853 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
## 0.026377242 0.034578060 0.010573836 0.007092199 
## 
## $value
## [1] 1
## 
## $options
## NULL
page_rank(d)
## $vector
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##  0.15461765  0.02432188  0.02795856  0.12644045  0.02273039  0.02509079 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##  0.02237395  0.04437007  0.02237395  0.03735639  0.08754881  0.02575659 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##  0.05924220  0.05050171  0.02237395  0.02316888  0.02273039  0.02856258 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##  0.07146324  0.04424606  0.03439756  0.02237395 
## 
## $value
## [1] 1
## 
## $options
## NULL

5 Lab 2

Take what we've covered and apply it to the twitter dataset.

6 What else?

6.1 There's lots (just a taster)

7 Appendices

7.1 Network Analysis Packages

  • R
  • Python (my language of choice)
    • networkx very established, not very fast for large datasets but very flexible (my package of choice for testing ideas)
    • igraph Effectively the same as the R package (both use the same underlying C++ library)
    • graph-tool Much newer, very speed efficient (also C++ underneath)
  • Windows
    • UCINET windows, developed and maintaied by (people including head of group at Manchester)

References

Browne, Kath. 2005. “Snowball Sampling: Using Social Networks to Research Non-Heterosexual Women.” International Journal of Social Research Methodology 8 (1): 47–60. doi:10.1080/1364557032000081663.

Christakis, Nicholas A., and James H. Fowler. 2013. “Social Contagion Theory: Examining Dynamic Social Networks and Human Behavior.” Statistics in Medicine 32 (4): 556–77. doi:10.1002/sim.5408.

Cohen-Cole, Ethan, and Jason M. Fletcher. 2008. “Is Obesity Contagious? Social Networks Vs. Environmental Factors in the Obesity Epidemic.” Journal of Health Economics 27 (5): 1382–7. doi:10.1016/j.jhealeco.2008.04.005.

Festinger, Leon. 1954. “Who Shall Survive?” Psychological Bulletin 51 (3). US: American Psychological Association: 322–23. doi:10.1037/h0049443.

Heath, Sue, Alison Fuller, and Brenda Johnston. 2009. “Chasing Shadows: Defining Network Boundaries in Qualitative Social Network Analysis.” Qualitative Research 9 (5). SAGE Publications: 645–61. doi:10.1177/1468794109343631.

Leskovec, Jure, Kevin J. Lang, and Michael Mahoney. 2010. “Empirical Comparison of Algorithms for Network Community Detection.” In Proceedings of the 19th International Conference on World Wide Web, 631–40. WWW ’10. Raleigh, North Carolina, USA: Association for Computing Machinery. doi:10.1145/1772690.1772755.

Mercken, L., T. A. B. Snijders, C. Steglich, E. Vartiainen, and H. de Vries. 2010. “Dynamics of Adolescent Friendship Networks and Smoking Behavior.” Social Networks 32 (1): 72–81. doi:10.1016/j.socnet.2009.02.005.

Murthy, Dhiraj. 2012. “Towards a Sociological Understanding of Social Media: Theorizing Twitter.” Sociology 46 (6). SAGE Publications Ltd: 1059–73. doi:10.1177/0038038511422553.

Noel, Hans, and Brendan Nyhan. 2011. “The ‘Unfriending’ Problem: The Consequences of Homophily in Friendship Retention for Causal Estimates of Social Influence.” Social Networks 33 (3): 211–18. doi:10.1016/j.socnet.2011.05.003.

Shalizi, Cosma Rohilla, and Andrew C. Thomas. 2011. “Homophily and Contagion Are Generically Confounded in Observational Social Network Studies.” Sociological Methods & Research 40 (2): 211–39. doi:10.1177/0049124111404820.

Tero, Atsushi, Seiji Takagi, Tetsu Saigusa, Kentaro Ito, Dan P. Bebber, Mark D. Fricker, Kenji Yumiki, Ryo Kobayashi, and Toshiyuki Nakagaki. 2010. “Rules for Biologically Inspired Adaptive Network Design.” Science 327 (5964). American Association for the Advancement of Science: 439–42. doi:10.1126/science.1177894.


  1. Even if you're in a broader *nix category there are options...

  2. See https://statmodeling.stat.columbia.edu/2011/06/10/christakis-fowl/ for some summary

  3. Full disclosure: one of my examiners was a co-author on one of these papers...